Search CORE

81 research outputs found

POIMs: positional oligomer importance matrices—understanding support vector machine-based signal detectors

Author: Alexander Zien
Arkhipova
Barash
Ben-Gal
Burke
Chen
Down
Graber
Graf
Gunnar Rätsch
Harris
Lanckriet
Leslie
Leslie
Meinicke
Ohler
Ohler
Petra Philips
Rätsch
Rätsch
Rätsch
Saeys
Schölkopf
Sonnenburg
Sonnenburg
Sonnenburg
Sonnenburg
Sonnenburg
Sören Sonnenburg
Vapnik
Zien
Üstün
Publication venue: Oxford University Press
Publication date: 01/01/2008
Field of study

Motivation: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts

Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

Author: Bernhard Schölkopf
Gunnar Rätsch
Hanh Witte
Jagan Srinivasan
Klaus-R Müller
Ralf-J Sommer
Sören Sonnenburg
The Caenorhabditis elegans sequencing consortium
Uwe Ohler
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2007
Field of study

For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology

CiteSeerX

Public Library of Science (PLOS)

Crossref

Fraunhofer-ePrints

Directory of Open Access Journals

PubMed Central

Caltech Authors

MPG.PuRe

Accurate splice site prediction using support vector machines

Author: Bmc Bioinformatics
Gabriele Schweikert
Gunnar Rätsch
Jonas Behr
Jonas Behr
Petra Philips
Petra Philips
Sören Sonnenburg
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Proceeding

CiteSeerX

Crossref

Springer - Publisher Connector

Fraunhofer-ePrints

PubMed Central

MPG.PuRe

Sample adaptive multiple kernel learning for failure prediction of railway points

Author: Afkanpour Arash
Gönen Mehmet
Ishak Muhammad Fitri
Kloft Marius
Le Quoc
Lei Yunwen
Li Xiang
Liu Xinwang
Liu Xinwang
Rakotomamonjy Alain
Shen Yanning
Sonnenburg Sören
Tao Hanqing
Xu Zenglin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/07/2019
Field of study

© 2019 Association for Computing Machinery. Railway points are among the key components of railway infrastructure. As a part of signal equipment, points control the routes of trains at railway junctions, having a significant impact on the reliability, capacity, and punctuality of rail transport. Meanwhile, they are also one of the most fragile parts in railway systems. Points failures cause a large portion of railway incidents. Traditionally, maintenance of points is based on a fixed time interval or raised after the equipment failures. Instead, it would be of great value if we could forecast points' failures and take action beforehand, min-imising any negative effect. To date, most of the existing prediction methods are either lab-based or relying on specially installed sensors which makes them infeasible for large-scale implementation. Besides, they often use data from only one source. We, therefore, explore a new way that integrates multi-source data which are ready to hand to fulfil this task. We conducted our case study based on Sydney Trains rail network which is an extensive network of passenger and freight railways. Unfortunately, the real-world data are usually incomplete due to various reasons, e.g., faults in the database, operational errors or transmission faults. Besides, railway points differ in their locations, types and some other properties, which means it is hard to use a unified model to predict their failures. Aiming at this challenging task, we firstly constructed a dataset from multiple sources and selected key features with the help of domain experts. In this paper, we formulate our prediction task as a multiple kernel learning problem with missing kernels. We present a robust multiple kernel learning algorithm for predicting points failures. Our model takes into account the missing pattern of data as well as the inherent variance on different sets of railway points. Extensive experiments demonstrate the superiority of our algorithm compared with other state-of-the-art methods

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

Support Vector Machines and Kernels for Computational Biology

ISSN:1553-734XISSN:1553-735

Repository for Publications and Research Data

Crossref

Fraunhofer-ePrints

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Prediction of donor splice sites using random forest with a new sequence encoding approach

Author: A Baten
A Dehzangi
A Liaw
A Zien
Atmakuri Ramakrishna Rao
BJ Blencowe
BJ Lam
C Bergmeir
C Burge
C Cortes
C Weihs
D Hand
D Meyer
G Yeo
H Drucker
J Huang
J Rajapakse
J Zhu
JL Li
L Breiman
M Khalilia
M Pertea
M Stone
MG Reese
MM Yin
MQ Zhang
N Sheth
P Jain
P Pollastro
Prabina Kumar Meher
R Staden
S Haykin
S Sören Sonnenburg
SE Hamby
T Mitchell
Tanmaya Kumar Sahu
TM Chen
WN Venables
X Roca
X Zhao
XF Zhang
Z Dominski
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Large scale multiple kernel learning

Author: Bernhard Schölkopf Bernhard
Emilio Parrado-hernández
P. Bennett
See Profile
Sören Sonnenburg
Sören Sonnenburg
Publication venue
Publication date: 01/07/2006
Field of study

While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available a

CiteSeerX

Fraunhofer-ePrints

MPG.PuRe

Large scale genomic sequence svm classifiers

Author: Bernhard Schölkopf Bernhard
Sören Sonnenburg
Publication venue: ACM Press
Publication date
Field of study

In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modifications of an SMOlike SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences

CiteSeerX

Abstract

Author: Konrad Rieck
Pavel Laskov
Sören Sonnenburg
Publication venue
Publication date: 01/01/2007
Field of study

We propose a generic algorithm for computation of similarity measures for sequential data. The algorithm uses generalized suffix trees for efficient calculation of various kernel, distance and non-metric similarity functions. Its worst-case run-time is linear in the length of sequences and independent of the underlying embedding language, which can cover words, k-grams or all contained subsequences. Experiments with network intrusion detection, DNA analysis and text processing applications demonstrate the utility of distances and similarity coefficients for sequences as alternatives to classical kernel functions.

CiteSeerX

Fraunhofer-ePrints